Numba JIT Compilation

The Trap That Silently Murders Performance

This code looks like it uses Numba:

from numba import jit
import numpy as np

@jit
def compute_distance_matrix(points):
    """Compute pairwise Euclidean distance matrix."""
    n = len(points)
    result = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            dx = points[i][0] - points[j][0]
            dy = points[i][1] - points[j][1]
            result[i][j] = (dx**2 + dy**2)**0.5
    return result

points = [(float(i), float(i*2)) for i in range(1000)]  # list of tuples

import time
start = time.perf_counter()
for _ in range(10):
    dist = compute_distance_matrix(points)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")

Time: 5.83s

Slower than pure Python on the same algorithm (4.21s). That is the trap. Numba fell back to object mode because points is a list of tuples - not a type Numba can compile natively. It compiled the code, but to a slow interpreted form that adds overhead instead of removing it.

Now the correct version:

from numba import njit
import numpy as np

@njit  # equivalent to @jit(nopython=True) - FAILS LOUDLY if types are wrong
def compute_distance_matrix_fast(points: np.ndarray) -> np.ndarray:
    """Compute pairwise Euclidean distance matrix - NumPy array input."""
    n = points.shape[0]
    result = np.zeros((n, n), dtype=np.float64)
    for i in range(n):
        for j in range(n):
            dx = points[i, 0] - points[j, 0]
            dy = points[i, 1] - points[j, 1]
            result[i, j] = (dx*dx + dy*dy)**0.5
    return result

points_np = np.array([(float(i), float(i*2)) for i in range(1000)])

# Warm up (first call compiles)
compute_distance_matrix_fast(points_np)

start = time.perf_counter()
for _ in range(10):
    dist = compute_distance_matrix_fast(points_np)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")

Time: 0.031s   (188x faster than the broken @jit version)

And with parallel=True:

from numba import njit, prange

@njit(parallel=True)
def compute_distance_matrix_parallel(points: np.ndarray) -> np.ndarray:
    n = points.shape[0]
    result = np.zeros((n, n), dtype=np.float64)
    for i in prange(n):          # parallel outer loop
        for j in range(n):
            dx = points[i, 0] - points[j, 0]
            dy = points[i, 1] - points[j, 1]
            result[i, j] = (dx*dx + dy*dy)**0.5
    return result

compute_distance_matrix_parallel(points_np)  # warm up

start = time.perf_counter()
for _ in range(10):
    dist = compute_distance_matrix_parallel(points_np)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")

Time: 0.009s   (648x faster than the broken @jit version, 8-core machine)

Version	Time	Notes
`@jit` with list of tuples	5.83 s	Object mode - DO NOT USE
Pure Python (no Numba)	4.21 s	Baseline
`@njit` with NumPy array	0.031 s	136x over Python
`@njit(parallel=True)` 8 cores	0.009 s	468x over Python

The @jit decorator without nopython=True is a footgun. Always use @njit or @jit(nopython=True).

What You Will Learn

Understand how Numba compiles Python to LLVM IR and native machine code
Use @njit and cache=True correctly for persistent compilation caching
Parallelise loops automatically with @njit(parallel=True) and prange
Understand Numba's type system and avoid the object mode trap
Create NumPy ufuncs from Python functions with @vectorize
Use @guvectorize for sliding-window and reduction operations
Write GPU kernels with @cuda.jit (optional - requires CUDA GPU)
Choose between Numba, Cython, and NumPy for different workloads

Prerequisites

Requirement	Level Needed
NumPy arrays and indexing	Comfortable
Python decorators	Comfortable
Basic threading/parallelism ideas	Helpful
CUDA concepts (Section 7 only)	Helpful

Section 1: How Numba Works

Numba is a just-in-time (JIT) compiler. Unlike Cython (which compiles to C at build time), Numba compiles Python functions to native machine code the first time they are called with specific argument types, using the LLVM compiler infrastructure.

First call: compute_distance_matrix_fast(points_np)
    │
    ▼
Numba inspects argument types:
    points → numpy.ndarray, dtype=float64, ndim=2
    │
    ▼
Numba generates LLVM IR (intermediate representation):
    define double* @compute_distance_matrix_fast(...) {
      %n = ...
      for.loop:
        %dx = fsub double %xi, %xj
        ...
    }
    │
    ▼
LLVM optimises and compiles to x86-64 machine code
    │
    ▼
Compiled function is cached (in memory and optionally on disk)
    │
    ▼
Subsequent calls: directly execute native code - zero Python overhead

`nopython` Mode vs `object` Mode

Mode	Behaviour	Performance
`nopython`	All types resolved to C types, no Python object access	Native C speed
`object`	Falls back to Python interpreter for unknown types	Slower than pure Python

In object mode, Numba still compiles the function but inserts calls back to the Python interpreter wherever it encounters types it does not understand. The compilation overhead is paid (first call is slow) but the runtime overhead is worse than pure Python (redundant dispatch through Numba's object layer).

Always use @njit (nopython=True). If Numba cannot compile in nopython mode, you get a clear error message at first call - not silent slowness.

Section 2: `@njit` and Compilation Caching

Basic `@njit`

from numba import njit
import numpy as np
import time

@njit
def exponential_moving_average(data: np.ndarray, alpha: float) -> np.ndarray:
    """
    Exponential moving average - cannot be vectorised with NumPy alone
    because each output depends on the previous one.
    """
    n = data.shape[0]
    result = np.empty(n, dtype=np.float64)
    result[0] = data[0]
    for i in range(1, n):
        result[i] = alpha * data[i] + (1.0 - alpha) * result[i - 1]
    return result

# Measure compilation time vs execution time
data = np.random.randn(1_000_000)

t0 = time.perf_counter()
out = exponential_moving_average(data, 0.1)  # includes compilation
t1 = time.perf_counter()
out = exponential_moving_average(data, 0.1)  # compiled - fast
t2 = time.perf_counter()

print(f"First call (compile + run): {(t1-t0)*1000:.1f} ms")
print(f"Second call (run only):     {(t2-t1)*1000:.3f} ms")

First call (compile + run): 847.3 ms
Second call (run only):       0.8 ms

The compilation cost (847ms) is paid once per Python interpreter session per unique type signature. This is the JIT warmup cost.

`cache=True` - Persistent Disk Cache

@njit(cache=True)  # save compiled code to __pycache__/
def exponential_moving_average(data: np.ndarray, alpha: float) -> np.ndarray:
    n = data.shape[0]
    result = np.empty(n, dtype=np.float64)
    result[0] = data[0]
    for i in range(1, n):
        result[i] = alpha * data[i] + (1.0 - alpha) * result[i - 1]
    return result

With cache=True, Numba saves the compiled LLVM bitcode to __pycache__/. On the next interpreter startup, the cache is loaded instead of recompiling. The first call is still slightly slower (cache loading) but not 847ms.

Cache is invalidated automatically when the source code changes.

Inspecting Compiled Types

# After the function has been called at least once:
exponential_moving_average.inspect_types()

Output:

exponential_moving_average (array(float64, 1d, C), float64)
--------------------------------------------------------------------------------
# File: ema.py
# --- LINE 5 ---
# label 0
#   data = arg(0, name=data)  :: array(float64, 1d, C)
#   alpha = arg(1, name=alpha)  :: float64
#   n = data.shape[0]           :: int64
#   result = np.empty(n, ...)   :: array(float64, 1d, C)
#   result[0] = data[0]         :: (store)

This shows exactly what types Numba inferred for each variable. If any variable shows pyobject, it is in object mode and will be slow.

Multiple Type Specialisations

@njit(cache=True)
def add_scalar(arr, val):
    result = np.empty_like(arr)
    for i in range(arr.shape[0]):
        result[i] = arr[i] + val
    return result

a_f32 = np.ones(100, dtype=np.float32)
a_f64 = np.ones(100, dtype=np.float64)
a_i64 = np.ones(100, dtype=np.int64)

add_scalar(a_f32, np.float32(1.0))  # compiles for (float32, float32)
add_scalar(a_f64, 1.0)              # compiles for (float64, float64)
add_scalar(a_i64, 1)                # compiles for (int64, int64)

# Each specialisation is separate compiled code
print(add_scalar.signatures)
# [(array(float32, 1d, C), float32),
#  (array(float64, 1d, C), float64),
#  (array(int64, 1d, C), int64)]

Each unique combination of argument types triggers a separate compilation. Numba caches all specialisations.

Section 3: `@njit(parallel=True)` and `prange`

Numba's parallel mode automatically parallelises loops across CPU cores using Intel TBB (Threading Building Blocks) or OpenMP. Unlike Cython's prange which requires manual setup, Numba handles thread management transparently.

from numba import njit, prange
import numpy as np

@njit(parallel=True, cache=True)
def parallel_rolling_sum(data: np.ndarray, window: int) -> np.ndarray:
    """
    Parallel rolling sum - each output position is independent.
    prange distributes iterations across CPU cores automatically.
    """
    n = data.shape[0]
    out_len = n - window + 1
    result = np.empty(out_len, dtype=np.float64)

    for i in prange(out_len):     # parallel: each i runs on a different thread
        total = 0.0
        for j in range(window):
            total += data[i + j]
        result[i] = total

    return result

Parallel Reduction

@njit(parallel=True)
def parallel_sum(arr: np.ndarray) -> float:
    """
    Parallel sum using Numba's automatic reduction detection.
    Numba detects that 'total += arr[i]' is a reduction and
    creates thread-private totals that are combined at the end.
    """
    total = 0.0
    for i in prange(arr.shape[0]):
        total += arr[i]
    return total

Parallel Image Filter

@njit(parallel=True, cache=True)
def gaussian_blur_2d(
    image: np.ndarray,     # shape: (H, W)
    kernel: np.ndarray,    # shape: (k, k)
) -> np.ndarray:
    """
    Apply a 2D convolution kernel to an image.
    Outer loop (rows) is parallelised across threads.
    """
    H, W = image.shape
    k = kernel.shape[0]
    pad = k // 2
    result = np.zeros_like(image)

    for row in prange(pad, H - pad):     # parallel over rows
        for col in range(pad, W - pad):
            val = 0.0
            for ki in range(k):
                for kj in range(k):
                    val += image[row - pad + ki, col - pad + kj] * kernel[ki, kj]
            result[row, col] = val

    return result

# Usage
import numpy as np
from numba import njit, prange

image = np.random.rand(1080, 1920).astype(np.float64)
kernel = np.ones((5, 5), dtype=np.float64) / 25.0  # box blur

# Warm up
gaussian_blur_2d(image, kernel)

import time
start = time.perf_counter()
for _ in range(10):
    blurred = gaussian_blur_2d(image, kernel)
elapsed = time.perf_counter() - start
print(f"10 runs: {elapsed:.3f}s  ({elapsed/10*1000:.1f}ms each)")

10 runs: 0.234s  (23.4ms each) - on 8 cores

Equivalent Python nested loops: ~45 seconds per iteration.

`prange` Safety Rules

prange is safe only when:

Iterations are independent - result[i] does not depend on result[i-1]
No shared write locations (write races)
Reductions are to a single scalar variable (Numba handles these automatically)

Unsafe example:

@njit(parallel=True)
def UNSAFE_cumsum(arr):
    result = np.zeros_like(arr)
    for i in prange(arr.shape[0]):
        result[i] = result[i-1] + arr[i]  # RACE: reads result[i-1] while another thread writes it
    return result

This produces silently wrong answers without raising an error.

Section 4: Numba's Type System

Understanding what Numba can and cannot compile is critical to avoiding object mode.

Supported Types

Python/NumPy Type	Numba Support	Notes
`int`, `float`, `bool`	Full	Mapped to C `int64`, `double`, `bool`
`complex`	Full	`complex128`
`numpy.ndarray`	Full	Any dtype, any ndim
`numpy.float64` scalars	Full	Used as C doubles
`tuple` (homogeneous)	Partial	Fixed-length only
`list` (homogeneous)	Partial	Reflected lists - use NumPy arrays
`dict`	Partial (typed)	`numba.typed.Dict` only
`str`	No	Not supported in nopython mode
`bytes`	No
Arbitrary Python objects	No	Triggers object mode
`pandas.DataFrame`	No	Extract NumPy arrays first
`datetime.datetime`	Partial	Via `numpy.datetime64`

What Does NOT Work in `nopython` Mode

from numba import njit

@njit
def FAILS_1(data):
    return len(str(data[0]))  # str() not supported in nopython

@njit
def FAILS_2(data):
    return sorted(data)  # sorted() not supported in nopython

@njit
def FAILS_3(data):
    import json           # imports not supported in nopython
    return json.dumps(data.tolist())

@njit
def FAILS_4(data):
    return {i: data[i] for i in range(len(data))}  # dict comprehension not supported

All of these raise TypingError or NumbaError at first call - which is the correct behaviour. You get the error immediately, not silent slowness.

Numba Typed Containers

For use cases that require mutable containers inside JIT functions:

from numba import njit
from numba.typed import List, Dict
import numba

@njit
def build_filtered_list(data: np.ndarray, threshold: float):
    """Build a filtered list inside a JIT function using Numba's typed List."""
    result = List()
    result.append(0.0)   # type inference from first append
    result.pop()

    for i in range(data.shape[0]):
        if data[i] > threshold:
            result.append(data[i])

    return result

@njit
def count_by_bucket(data: np.ndarray, n_buckets: int):
    """Histogram using Numba's typed Dict."""
    counts = Dict.empty(
        key_type=numba.int64,
        value_type=numba.int64,
    )
    for i in range(data.shape[0]):
        bucket = int(data[i]) % n_buckets
        if bucket in counts:
            counts[bucket] += 1
        else:
            counts[bucket] = 1
    return counts

Note: In most cases, using a pre-allocated NumPy array instead of a Numba typed container is faster and simpler.

Section 5: `@vectorize` - Creating NumPy Ufuncs

@vectorize creates a NumPy universal function (ufunc) from a Python function that operates on scalar values. The resulting ufunc broadcasts automatically over arrays of any shape, just like np.sin, np.exp, etc.

from numba import vectorize
import numpy as np

# @vectorize takes a list of type signatures: output(input1, input2, ...)
@vectorize(['float64(float64)', 'float32(float32)'])
def leaky_relu(x):
    """Leaky ReLU activation - operates on a single scalar."""
    return x if x > 0.0 else 0.01 * x


@vectorize(['float64(float64, float64)'])
def huber_loss(prediction, target):
    """
    Huber loss - robust to outliers.
    δ = 1.0
    L = 0.5*(pred-tgt)² if |pred-tgt| ≤ 1 else |pred-tgt| - 0.5
    """
    diff = abs(prediction - target)
    if diff <= 1.0:
        return 0.5 * diff * diff
    return diff - 0.5


# These functions now work on arrays of any shape
predictions = np.random.randn(1_000_000)
targets = np.random.randn(1_000_000)

losses = huber_loss(predictions, targets)   # vectorised over 1M elements
print(losses.shape)   # (1000000,)
print(losses.mean())

Parallel Ufunc

@vectorize(['float64(float64, float64)'], target='parallel')
def huber_loss_parallel(prediction, target):
    """Same function, parallel execution across CPU cores."""
    diff = abs(prediction - target)
    if diff <= 1.0:
        return 0.5 * diff * diff
    return diff - 0.5

# target options: 'cpu' (default), 'parallel' (multi-core), 'cuda' (GPU)

Custom Activation Functions - A Practical Example

from numba import vectorize
import numpy as np

@vectorize(['float64(float64, float64, float64)'], target='parallel', cache=True)
def parametric_relu(x, alpha, threshold):
    """PReLU with configurable negative slope and threshold."""
    if x >= threshold:
        return x
    return alpha * (x - threshold)


@vectorize(['float64(float64)'], target='parallel', cache=True)
def selu(x):
    """
    Scaled Exponential Linear Unit - requires exact constants
    for self-normalising property.
    """
    alpha = 1.6732632423543772
    scale = 1.0507009873554805
    if x > 0.0:
        return scale * x
    return scale * alpha * (2.718281828459045 ** x - 1.0)


# Benchmark vs NumPy equivalent
data = np.random.randn(5_000_000)

import time

# NumPy implementation
def selu_numpy(x):
    alpha = 1.6732632423543772
    scale = 1.0507009873554805
    return np.where(x > 0, scale * x, scale * alpha * (np.exp(x) - 1.0))

start = time.perf_counter()
for _ in range(20):
    out = selu_numpy(data)
print(f"NumPy SELU:    {(time.perf_counter()-start)/20*1000:.2f}ms")

selu(data)  # warm up
start = time.perf_counter()
for _ in range(20):
    out = selu(data)
print(f"Numba SELU:    {(time.perf_counter()-start)/20*1000:.2f}ms")

NumPy SELU:    48.23ms
Numba SELU:    12.41ms  (parallel, 4-core)

The NumPy version allocates a temporary array for np.exp(x) and another for the np.where result. Numba's ufunc fuses all operations into a single pass - one read per element, one write per element, no temporaries.

Section 6: `@guvectorize` - Generalised Ufuncs

@guvectorize extends @vectorize to operations that consume or produce arrays of a fixed shape per "element". It is the right tool for sliding windows, normalisation along an axis, and small matrix operations.

from numba import guvectorize
import numpy as np

# Layout string: (n),(n)->(n) means:
# first input is 1D length-n, second input is 1D length-n, output is 1D length-n
@guvectorize(
    ['void(float64[:], float64[:], float64[:])'],
    '(n),(n)->(n)',
    target='parallel',
    cache=True,
)
def normalise_to_reference(signal, reference, out):
    """
    Normalise each channel of signal relative to the corresponding reference.
    Operates on 1D slices; guvectorize broadcasts over higher dimensions.
    """
    n = signal.shape[0]
    ref_sum = 0.0
    for i in range(n):
        ref_sum += reference[i]

    if ref_sum == 0.0:
        for i in range(n):
            out[i] = 0.0
    else:
        for i in range(n):
            out[i] = signal[i] / ref_sum


# Sliding window operation
@guvectorize(
    ['void(float64[:], float64[:])'],
    '(n)->()',      # n-element input → scalar output
    target='parallel',
    cache=True,
)
def window_max(window, out):
    """Maximum value over a sliding window."""
    m = window[0]
    for i in range(1, window.shape[0]):
        if window[i] > m:
            m = window[i]
    out[0] = m

Section 7: Numba for CUDA - GPU Kernels (Optional)

Numba's CUDA support lets you write GPU kernels in Python. This section requires a CUDA-capable NVIDIA GPU and the CUDA toolkit installed.

# Check if CUDA is available
python -c "from numba import cuda; print(cuda.gpus)"

The CUDA Programming Model

CPU (Host)                    GPU (Device)
─────────                     ─────────────
Python code runs here         Parallel kernel runs here

                              Grid
                              ┌──────────────────────┐
                              │  Block (0,0)          │
                              │  ┌───┬───┬───┬───┐   │
                              │  │T0 │T1 │T2 │T3 │   │  ← threads execute in parallel
                              │  └───┴───┴───┴───┘   │
                              ├──────────────────────┤
                              │  Block (1,0)          │
                              │  ┌───┬───┬───┬───┐   │
                              │  │T0 │T1 │T2 │T3 │   │
                              │  └───┴───┴───┴───┘   │
                              └──────────────────────┘

Each CUDA kernel invocation launches blocks × threads_per_block parallel threads. Each thread knows its position via cuda.threadIdx and cuda.blockIdx.

GPU Vector Addition

from numba import cuda
import numpy as np
import math

@cuda.jit
def vector_add_gpu(a, b, c):
    """
    GPU kernel: each thread adds one element.
    Thread index determines which element this thread processes.
    """
    # Compute this thread's global index
    thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x

    if thread_id < a.shape[0]:  # bounds check - total threads may exceed array size
        c[thread_id] = a[thread_id] + b[thread_id]


def add_on_gpu(a_np: np.ndarray, b_np: np.ndarray) -> np.ndarray:
    """Transfer arrays to GPU, run kernel, transfer result back."""
    n = a_np.shape[0]

    # Transfer to GPU memory
    a_gpu = cuda.to_device(a_np)
    b_gpu = cuda.to_device(b_np)
    c_gpu = cuda.device_array(n, dtype=np.float64)

    # Configure launch: 256 threads per block, enough blocks for all elements
    threads_per_block = 256
    blocks_per_grid = math.ceil(n / threads_per_block)

    # Launch kernel
    vector_add_gpu[blocks_per_grid, threads_per_block](a_gpu, b_gpu, c_gpu)

    # Transfer result back to CPU
    return c_gpu.copy_to_host()


# Test
n = 10_000_000
a = np.random.rand(n)
b = np.random.rand(n)

result_gpu = add_on_gpu(a, b)
result_cpu = a + b

assert np.allclose(result_gpu, result_cpu)
print("GPU result matches CPU result.")

GPU Matrix Multiplication

from numba import cuda
import numpy as np
import math

BLOCK_SIZE = 16   # 16×16 = 256 threads per block

@cuda.jit
def matmul_gpu(A, B, C):
    """
    GPU matrix multiplication using shared memory tiling.
    Each thread block computes a BLOCK_SIZE × BLOCK_SIZE tile of C.
    """
    # Shared memory tiles - allocated per block
    tile_A = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=numba.float64)
    tile_B = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=numba.float64)

    row = cuda.blockIdx.y * BLOCK_SIZE + cuda.threadIdx.y
    col = cuda.blockIdx.x * BLOCK_SIZE + cuda.threadIdx.x

    n = A.shape[0]
    tmp = 0.0

    for tile_idx in range(math.ceil(n / BLOCK_SIZE)):
        # Load tiles into shared memory
        tr = cuda.threadIdx.y
        tc = cuda.threadIdx.x

        if row < n and tile_idx * BLOCK_SIZE + tc < n:
            tile_A[tr, tc] = A[row, tile_idx * BLOCK_SIZE + tc]
        else:
            tile_A[tr, tc] = 0.0

        if tile_idx * BLOCK_SIZE + tr < n and col < n:
            tile_B[tr, tc] = B[tile_idx * BLOCK_SIZE + tr, col]
        else:
            tile_B[tr, tc] = 0.0

        # Wait for all threads in the block to finish loading
        cuda.syncthreads()

        # Compute the dot product for this tile
        for k in range(BLOCK_SIZE):
            tmp += tile_A[tr, k] * tile_B[k, tc]

        # Wait before loading next tile
        cuda.syncthreads()

    if row < n and col < n:
        C[row, col] = tmp

When CUDA Helps (and When It Does Not)

Workload	GPU Benefit	Notes
Large matrix operations (N > 1000)	10–100x	Memory bandwidth becomes limiting
Batch inference (deep learning)	10–100x	Use CUDA via PyTorch/TensorFlow, not @cuda.jit
Monte Carlo simulation (millions of trials)	10–50x	Embarrassingly parallel
Image processing (large batches)	10–50x	Pixel operations are embarrassingly parallel
Small arrays (N < 100)	Slower	GPU launch overhead dominates
Sequential algorithms (depends on prev output)	Little benefit	Not parallelisable
I/O-bound work	No benefit	CPU waits on disk/network, not computation

Section 8: Numba vs Cython vs NumPy - Decision Table

Choose based on your situation, not on what you've used before:

Criterion	NumPy	Numba @njit	Cython
Build system required	No	No	Yes (C compiler)
Zero-overhead at import time	Yes	No (first-call JIT)	Yes (pre-compiled)
Arbitrary loop logic	Hard (vectorise it)	Yes	Yes
Non-numerical Python objects	No	Limited	Yes (with overhead)
GIL release	Yes (some ops)	Yes (nogil=True)	Yes (with nogil:)
GPU support	No (use CuPy)	Yes (@cuda.jit)	No
Parallel CPU	Via BLAS	Yes (prange)	Yes (prange+OpenMP)
C library integration	No	No	Yes (cdef extern)
Debugging ease	High	Medium	Low (C errors)
Works with PyPy	Yes	No	No
Supports `nopython` complex workflows	N/A	With typed Dict/List	Yes

Decision Rule

Is the bottleneck a tight numerical loop over arrays?
  YES → Try Numba @njit first (zero build complexity)
        If Numba cannot handle the types → use Cython memoryviews
  NO  → Is the bottleneck expressible as array operations?
          YES → NumPy vectorisation
          NO  → Profile more carefully - may not be CPU-bound

Interview Questions

Q1: What is the difference between @jit and @njit in Numba? Why should you almost always prefer @njit?

@jit is @jit(nopython=False) by default. When Numba encounters types it cannot compile natively - Python lists, strings, arbitrary objects - it silently falls back to object mode: a compiled but slow form that still uses the Python interpreter for unsupported operations. The function compiles (pays compilation overhead) but runs slower than pure Python because of redundant dispatch.

@njit is @jit(nopython=True). It raises a TypingError immediately at first call if any variable type cannot be resolved to a native C type. There is no silent fallback.

You should almost always prefer @njit because:

Silent object mode fallback is a performance footgun - you pay compile overhead but get no speedup
The error message from TypingError tells you exactly which type caused the problem
It enforces discipline around Numba-compatible types upfront

The only reason to use @jit (without nopython=True) is during prototyping when you want partial compilation while you gradually make types compatible.

Q2: Explain Numba's type specialisation. What happens when you call a @njit function with different argument types?

Numba compiles a separate native code version for each unique combination of argument types. The first time f(arr_f64) is called, Numba compiles a version specialised for float64 arrays. The first time f(arr_f32) is called, Numba compiles another version specialised for float32 arrays. These specialisations are stored in a dispatch table.

Subsequent calls with the same types execute the already-compiled native code directly - no Python overhead at all, just a type lookup in the dispatch table (O(1)) and a native function call.

This means:

First call per type signature pays the compilation cost (often hundreds of milliseconds)
Subsequent calls are as fast as optimised C code
Many different type signatures = many separate compilations = longer warmup time

With cache=True, compiled specialisations are saved to __pycache__ and loaded on subsequent interpreter starts, eliminating warmup cost after the first run.

Q3: What are the safety requirements for using prange in Numba? What goes wrong if you violate them?

prange distributes loop iterations across CPU threads. The safety requirements are:

Independence: result[i] must not depend on result[i-1] (or any value computed in another iteration). If iteration i=5 reads result[4] while the thread computing result[4] has not finished yet, you get a data race: undefined behaviour, silently wrong results.
No conflicting writes: Two iterations must not write to the same memory location. If they do, whichever thread writes last wins - the other write is lost.
Reductions are safe but must be on a single scalar variable: total += arr[i] inside a prange loop is automatically detected as a reduction. Numba creates thread-private copies of total, accumulates locally, and merges at the end. This is correct.
No Python objects: The GIL is not held inside prange. Any Python API call (creating a list, appending to a dict, calling a Python function) will segfault or produce race conditions.

Violations produce silently wrong results - not exceptions. This is the most dangerous aspect of prange. Test parallel code against a serial reference implementation with np.allclose() before trusting it.

Q4: What is the difference between @vectorize and @guvectorize? Give a use case for each.

@vectorize creates a ufunc from a function that takes scalar inputs and returns a scalar output. The scalar function is applied element-wise to arrays, with automatic broadcasting. Example use cases: activation functions (ReLU, SELU), element-wise loss functions (Huber loss), custom clipping functions.

@guvectorize creates a generalised ufunc from a function that takes arrays of specified shapes as inputs and outputs. The layout string specifies the shape contract: '(n)->()' means "take a 1D array of length n, produce a scalar". Example use cases: sliding window aggregation (max, mean over a window), normalisation of each row/column of a matrix, dot product (single row-column pair → scalar).

The key difference: @vectorize processes one scalar per call; @guvectorize processes one array slice per call. For a sliding window maximum over a 2D input (rows, window_size), @vectorize cannot express this naturally (it only handles scalars), but @guvectorize with layout '(n)->()' applies the scalar-producing function to each row independently and broadcasts over the batch dimension.

Q5: You have a function decorated with @njit that processes financial tick data. In production, the first request each morning takes 3 seconds to respond because Numba is recompiling. How do you fix this?

There are three complementary approaches:

1. cache=True - the simplest fix. @njit(cache=True) saves the compiled LLVM bitcode to __pycache__/. On the next interpreter start, the cache is loaded (fast) instead of recompiling. The first call is still slightly slower (cache load + machine code generation from bitcode) but not seconds slower. This eliminates the 3-second compile on every restart.

2. Warmup at startup - call the function with representative dummy data during application startup (before accepting requests), so the compilation happens during the startup phase rather than on the first real request:

# In app startup (before accepting traffic)
import numpy as np
_dummy = np.zeros((1, 10), dtype=np.float64)
process_tick_data(_dummy)   # triggers compilation

3. Ahead-of-time compilation with numba.pycc - Numba supports pre-compiling modules to .so files that can be imported like any C extension, eliminating all JIT overhead:

# aot_module.py
from numba.pycc import CC
cc = CC('compiled_ticks')

@cc.export('process_tick_data', 'f8[:](f8[:, :])')
def process_tick_data(data):
    ...

if __name__ == '__main__':
    cc.compile()

In practice, cache=True plus startup warmup resolves most production JIT latency issues. AOT compilation is reserved for environments where deployment constraints prevent JIT (e.g., strict container sandboxes).

The Trap That Silently Murders Performance​

What You Will Learn​

Prerequisites​

Section 1: How Numba Works​

nopython Mode vs object Mode​

Section 2: @njit and Compilation Caching​

Basic @njit​

cache=True - Persistent Disk Cache​

Inspecting Compiled Types​

Multiple Type Specialisations​

Section 3: @njit(parallel=True) and prange​

Parallel Reduction​

Parallel Image Filter​

prange Safety Rules​

Section 4: Numba's Type System​

Supported Types​

What Does NOT Work in nopython Mode​

Numba Typed Containers​

Section 5: @vectorize - Creating NumPy Ufuncs​

Parallel Ufunc​

Custom Activation Functions - A Practical Example​

Section 6: @guvectorize - Generalised Ufuncs​

Section 7: Numba for CUDA - GPU Kernels (Optional)​

The CUDA Programming Model​

GPU Vector Addition​

GPU Matrix Multiplication​

When CUDA Helps (and When It Does Not)​

Section 8: Numba vs Cython vs NumPy - Decision Table​

Decision Rule​

Interview Questions​